Meet the penguins

The palmerpenguins data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.

The Palmer Archipelago penguins. Artwork by @allison_horst.
The Palmer Archipelago penguins. Artwork by @allison_horst.


These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data were imported directly from the Environmental Data Initiative (EDI) Data Portal, and are available for use by CC0 license (“No Rights Reserved”) in accordance with the Palmer Station Data Policy.

Installation

You can install the released version of palmerpenguins from CRAN with:

install.packages("palmerpenguins")

Or install the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("allisonhorst/palmerpenguins")

The palmerpenguins package

This package contains two datasets:

  1. Here, we’ll focus on a curated subset of the raw data in the package named penguins.

  2. The raw data, accessed from the Environmental Data Initiative (see full data citations below), is also available as palmerpenguins::penguins_raw.

The curated palmerpenguins::penguins dataset contains 8 variables (n = 344 penguins). You can read more about the variables by typing ?penguins.

glimpse(penguins)
#> Rows: 344
#> Columns: 8
#> $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <fct> male, female, female, NA, female, male, female, male…
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

The palmerpenguins::penguins data contains 333 complete cases, with 19 missing values.

Challenge! Let’s find the smallest penguin observed in each species.

penguins %>% 
  group_by(species) %>% 
  filter(body_mass_g == min(body_mass_g, na.rm = TRUE))
#> # A tibble: 4 × 8
#> # Groups:   species [3]
#>   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
#> 1 Adelie    Biscoe           36.5          16.6               181        2850
#> 2 Adelie    Biscoe           36.4          17.1               184        2850
#> 3 Gentoo    Biscoe           42.7          13.7               208        3950
#> 4 Chinstrap Dream            46.9          16.6               192        2700
#> # ℹ 2 more variables: sex <fct>, year <int>

Bill dimensions

The culmen is the upper ridge of a bird’s bill. In the simplified penguins data, culmen length and depth are renamed as variables bill_length_mm and bill_depth_mm to be more intuitive.

For this penguin data, the culmen (bill) length and depth are measured as shown below (thanks Kristen Gorman for clarifying!):

Practice mutating – let’s create a new column that has bill size (area, in square milimeters)


penguins %>% 
  mutate(bill_size_mm2 = bill_depth_mm * bill_length_mm) %>% 
  head()
#> # A tibble: 6 × 9
#>   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
#> 1 Adelie  Torgersen           39.1          18.7               181        3750
#> 2 Adelie  Torgersen           39.5          17.4               186        3800
#> 3 Adelie  Torgersen           40.3          18                 195        3250
#> 4 Adelie  Torgersen           NA            NA                  NA          NA
#> 5 Adelie  Torgersen           36.7          19.3               193        3450
#> 6 Adelie  Torgersen           39.3          20.6               190        3650
#> # ℹ 3 more variables: sex <fct>, year <int>, bill_size_mm2 <dbl>

Let’s select all columns that contain measurements in mm.

penguins %>% 
  select(ends_with("mm"))
#> # A tibble: 344 × 3
#>    bill_length_mm bill_depth_mm flipper_length_mm
#>             <dbl>         <dbl>             <int>
#>  1           39.1          18.7               181
#>  2           39.5          17.4               186
#>  3           40.3          18                 195
#>  4           NA            NA                  NA
#>  5           36.7          19.3               193
#>  6           39.3          20.6               190
#>  7           38.9          17.8               181
#>  8           39.2          19.6               195
#>  9           34.1          18.1               193
#> 10           42            20.2               190
#> # ℹ 334 more rows

Let’s select all columns that contain measurements in mm.

penguins %>% 
  select(contains("mm"))
#> # A tibble: 344 × 3
#>    bill_length_mm bill_depth_mm flipper_length_mm
#>             <dbl>         <dbl>             <int>
#>  1           39.1          18.7               181
#>  2           39.5          17.4               186
#>  3           40.3          18                 195
#>  4           NA            NA                  NA
#>  5           36.7          19.3               193
#>  6           39.3          20.6               190
#>  7           38.9          17.8               181
#>  8           39.2          19.6               195
#>  9           34.1          18.1               193
#> 10           42            20.2               190
#> # ℹ 334 more rows

Let’s find the median body mass for each species (using mutate()).

penguins %>% 
  remove_missing() %>% 
  group_by(species) %>% 
  mutate(body_mass_median = median(body_mass_g))
#> # A tibble: 333 × 9
#> # Groups:   species [3]
#>    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
#>  1 Adelie  Torgersen           39.1          18.7               181        3750
#>  2 Adelie  Torgersen           39.5          17.4               186        3800
#>  3 Adelie  Torgersen           40.3          18                 195        3250
#>  4 Adelie  Torgersen           36.7          19.3               193        3450
#>  5 Adelie  Torgersen           39.3          20.6               190        3650
#>  6 Adelie  Torgersen           38.9          17.8               181        3625
#>  7 Adelie  Torgersen           39.2          19.6               195        4675
#>  8 Adelie  Torgersen           41.1          17.6               182        3200
#>  9 Adelie  Torgersen           38.6          21.2               191        3800
#> 10 Adelie  Torgersen           34.6          21.1               198        4400
#> # ℹ 323 more rows
#> # ℹ 3 more variables: sex <fct>, year <int>, body_mass_median <dbl>

Let’s find the median body mass for each species (using summarize()).

penguins %>% 
  remove_missing() %>% 
  group_by(species) %>% 
  summarize(body_mass_median = median(body_mass_g))
#> # A tibble: 3 × 2
#>   species   body_mass_median
#>   <fct>                <dbl>
#> 1 Adelie                3700
#> 2 Chinstrap             3700
#> 3 Gentoo                5050

Let’s find the median of everything! This time also grouping by year

penguins %>% 
  remove_missing() %>% 
  group_by(species, year) %>% 
  summarize(across(where(is.numeric), median))
#> # A tibble: 9 × 6
#> # Groups:   species [3]
#>   species    year bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>     <int>          <dbl>         <dbl>             <dbl>       <dbl>
#> 1 Adelie     2007           39            18.6              186         3675
#> 2 Adelie     2008           38.6          18.3              190         3700
#> 3 Adelie     2009           38.7          18.0              191         3600
#> 4 Chinstrap  2007           48.8          18.2              194.        3700
#> 5 Chinstrap  2008           49.2          18.5              198.        3750
#> 6 Chinstrap  2009           50.0          18.6              198         3675
#> 7 Gentoo     2007           46.7          14.6              215         5050
#> 8 Gentoo     2008           46.4          15                219         5000
#> 9 Gentoo     2009           48.8          15.2              218         5200

Let’s create a new column that classifies bill size into two categories – big or small.

threshold <- 800 ### first define a threshold to distinguish big from small
penguins %>% 
  mutate(bill_size_mm2 = bill_depth_mm * bill_length_mm,
         bill_size_binary = ifelse(bill_size_mm2 > threshold, "big", "small")) %>% 
  select(bill_size_binary, bill_size_mm2, everything()) %>% 
  head()
#> # A tibble: 6 × 10
#>   bill_size_binary bill_size_mm2 species island    bill_length_mm bill_depth_mm
#>   <chr>                    <dbl> <fct>   <fct>              <dbl>         <dbl>
#> 1 small                     731. Adelie  Torgersen           39.1          18.7
#> 2 small                     687. Adelie  Torgersen           39.5          17.4
#> 3 small                     725. Adelie  Torgersen           40.3          18  
#> 4 <NA>                       NA  Adelie  Torgersen           NA            NA  
#> 5 small                     708. Adelie  Torgersen           36.7          19.3
#> 6 big                       810. Adelie  Torgersen           39.3          20.6
#> # ℹ 4 more variables: flipper_length_mm <int>, body_mass_g <int>, sex <fct>,
#> #   year <int>

Exploring factors

The penguins data has three factor variables:

penguins %>%
  dplyr::select(where(is.factor)) %>% 
  glimpse()
#> Rows: 344
#> Columns: 3
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie…
#> $ island  <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
#> $ sex     <fct> male, female, female, NA, female, male, female, male, NA, NA, …
# Count penguins for each species / island
penguins %>%
  count(species, island, .drop = FALSE)
#> # A tibble: 9 × 3
#>   species   island        n
#>   <fct>     <fct>     <int>
#> 1 Adelie    Biscoe       44
#> 2 Adelie    Dream        56
#> 3 Adelie    Torgersen    52
#> 4 Chinstrap Biscoe        0
#> 5 Chinstrap Dream        68
#> 6 Chinstrap Torgersen     0
#> 7 Gentoo    Biscoe      124
#> 8 Gentoo    Dream         0
#> 9 Gentoo    Torgersen     0
ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar(alpha = 0.8) +
  scale_fill_manual(values = c("darkorange","purple","cyan4"), 
                    guide = FALSE) +
  theme_minimal() +
  facet_wrap(~species, ncol = 1) +
  coord_flip()

# Count penguins for each species / sex
penguins %>%
  count(species, sex, .drop = FALSE)
#> # A tibble: 8 × 3
#>   species   sex        n
#>   <fct>     <fct>  <int>
#> 1 Adelie    female    73
#> 2 Adelie    male      73
#> 3 Adelie    <NA>       6
#> 4 Chinstrap female    34
#> 5 Chinstrap male      34
#> 6 Gentoo    female    58
#> 7 Gentoo    male      61
#> 8 Gentoo    <NA>       5
ggplot(penguins, aes(x = sex, fill = species)) +
  geom_bar(alpha = 0.8) +
  scale_fill_manual(values = c("darkorange","purple","cyan4"), 
                    guide = FALSE) +
  theme_minimal() +
  facet_wrap(~species, ncol = 1) +
  coord_flip()

# Penguins are fun to summarize!
penguins %>% 
  count(species)
#> # A tibble: 3 × 2
#>   species       n
#>   <fct>     <int>
#> 1 Adelie      152
#> 2 Chinstrap    68
#> 3 Gentoo      124
penguins %>% 
  group_by(species) %>% 
  summarize(across(where(is.numeric), mean, na.rm = TRUE))
#> # A tibble: 3 × 6
#>   species   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
#>   <fct>              <dbl>         <dbl>             <dbl>       <dbl> <dbl>
#> 1 Adelie              38.8          18.3              190.       3701. 2008.
#> 2 Chinstrap           48.8          18.4              196.       3733. 2008.
#> 3 Gentoo              47.5          15.0              217.       5076. 2008.

Exploring scatterplots

penguins %>%
  dplyr::select(body_mass_g, ends_with("_mm")) %>% 
  glimpse()
#> Rows: 344
#> Columns: 4
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
# Scatterplot example 1: penguin flipper length versus body mass
ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, 
                 shape = species),
             size = 2) +
  scale_color_manual(values = c("darkorange","darkorchid","cyan4")) 


# Scatterplot example 2: penguin bill length versus bill depth
ggplot(data = penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, 
                 shape = species),
             size = 2)  +
  scale_color_manual(values = c("darkorange","darkorchid","cyan4"))

You can add color and/or shape aesthetics in ggplot2 to layer in factor levels like we did above. With three factor variables to work with, you can add another factor layer with facets, like the plot below.

ggplot(penguins, aes(x = flipper_length_mm,
                     y = body_mass_g)) +
  geom_point(aes(color = sex)) +
  scale_color_manual(values = c("darkorange","cyan4"), 
                     na.translate = FALSE) +
  facet_wrap(~species)

Exploring distributions

# Jitter plot example: bill length by species
ggplot(data = penguins, aes(x = species, y = bill_length_mm)) +
  geom_jitter(aes(color = species),
              width = 0.1, 
              alpha = 0.7,
              show.legend = FALSE) +
  scale_color_manual(values = c("darkorange","darkorchid","cyan4"))


# Histogram example: flipper length by species
ggplot(data = penguins, aes(x = flipper_length_mm)) +
  geom_histogram(aes(fill = species), alpha = 0.5, position = "identity") +
  scale_fill_manual(values = c("darkorange","darkorchid","cyan4"))

References

Data originally published in:

Individual datasets:

Individual data can be accessed directly via the Environmental Data Initiative: